pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, DT)
options(scipen = 999)Take Home Exercise 3
Mini Case 3 of Vast Challenge 2023
1. OVERVIEW
1.1 The Task
2. Datasets
3. Data Preparation
3.1 Install R-packages
Using p_load() of pacman package to load and install the following libraries:
3.2 Importing Data
Importing JSON file by using jsonlite package.
MC3_challenge <- fromJSON("data/MC3.json")This is not a directed graph. There is no flow by degree (directed = FALSE)
3.2.1 Extracting Edges
As the imported data file is a large list, we will extract the edges from MC3_challenge and save it as a tibble data frame called MC3_edges. The code is extracted in the following manner:
distinct()is used to remove duplicated recordsmutate()andas.character()are used to convert field data type from list to charactergroup_by()andsummarise()are used to count the number of unique linksfilter(source!=target)is used to ensure that both companies are not identical
Show the code
MC3_edges <-as_tibble(MC3_challenge$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source,target, type) %>%
summarise(weights = n()) %>%
filter (source != target) %>%
ungroup()3.2.2 Extracting Nodes
Similarly, we will extract the nodes from MC3_challenge and save it as a tibble data frame called MC3_nodes. The code is extracted in the following manner:
mutate()andas.character()are used to convert data type from list to characteras.numeric(as.character())are used to convert revenue_omu from list to character before converting it to numeric data type.select()is used to reorganize the sequence
MC3_nodes <-as_tibble(MC3_challenge$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id,country,type,revenue_omu,product_services)4. Data Exploration
In this section, we will explore the nodes and edges data frame to identify aspects for data wrangling.
4.1 Exploring the edges data frame
skim() of [skimr] package is used to display the summary statistic of MC3_edges tibble data frame. As observed, there are not missing values in all fields.
skim(MC3_edges)| Name | MC3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
datatable() of the [DT] package is used to display MC3_edges tiddle data frame as an interactive table.
DT::datatable(MC3_edges)4.1.1 Plotting bar chart
ggplot(data = MC3_edges,
aes(x= type)) +
geom_bar()
4.2 Exploring the nodes
skim() of [skimr] package is used to display the summary statistic of MC3_nodes tibble data frame. As observed, there are not missing values in all fields.
skim(MC3_nodes)| Name | MC3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
DT:: datatable(MC3_nodes)4.2.1 Plotting the bar chart
ggplot(data = MC3_nodes,
aes(x=type)) +
geom_bar()
5. Network Visualization and Analysis
5. 1 Building network model with tidygraph
id1 <- MC3_edges %>%
select(source) %>%
rename(id = source)
id2 <- MC3_edges %>%
select(target) %>%
rename(id = target)
MC3_nodes_master <- rbind(id1, id2) %>%
distinct() %>%
left_join(MC3_nodes,
unmatched = "drop")MC3_graph <- tbl_graph(nodes = MC3_nodes_master,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())
MC3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()
6. Text Sensing with Tidytext
[tidytext] package
6.1 Simple Word count
MC3_nodes %>%
mutate(n_fish = str_count(product_services, "fish"))# A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashimi … Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conservati… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liability … Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
6.2 Tokenisation
Tokenisation refers to the process of breaking up a given text into units called tokens. It could be individual words, phrases or the entire sentences. unnest_tokens() is used. The unnested text goes to the output column - word after extracting it from product_services.
token_nodes <- MC3_nodes %>%
unnest_tokens(word,
product_services)6.2.1 Visualizing the Extracted words
token_nodes %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
6.3 Removing stopwords
stopwords_removed <- token_nodes %>%
anti_join(stop_words)
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
Given that there are 7750 unique texts in the word columnwe will focus on the words that are related to illegal fishing.
length(unique(stopwords_removed$word))[1] 7750
#create custom stop words vector
custom_stopwords <- c("food")
clean_text <- stopwords_removed %>%
filter(!word %in% custom_stopwords) %>%
filter(!grepl("\\d", word)) %>% #to remove numbers
filter(grepl("fish", word) | grepl("seafood", word ) | grepl("shrimp", word))
# mutate(category = case_when(
# grepl("products", word, ignore.case= TRUE) ~ "Products",
# grepl("services", word, ignore.case = TRUE) ~ "Services",
# TRUE ~ 'Others'))
clean_text %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")